15 May 2019
Determine the need through collaboration
Articulate the need as a question
Scope out the 'minimum viable product'
Allow time for peer review
Communicate the findings in appropriate ways
Data, analysis outputs, and visualisation are stored in one place
Data is easily manipulated - high risk of introducing errors
Point-and-click interface - manual repetition
Actions are not logged
Slow when working with large data sets or files storing many computations
Complex Excel workbooks can 'break' when shared with others
Separation of data, analysis outputs, visualisations, and reports
Reproducible, transparent, and less error-prone workflow
Much faster for repetitive actions than point-and-click
Requires you to think early and often about data structure
Procedures are documented via scripts and command history
Build up a library of repeatable, adaptable procedures
R is a statistical programming language, RStudio is an IDE for using R
Its most fundamental component is 'base R' but it is highly extensible
R is free and open source: users benefit from packages created by others
The R community is global and active - lots of help for when you get stuck!
Demo 1 uses the winter data set downloadable from this link.
It includes all podium results from Winter Olympic Games that have taken place from 1924 to 2014, inclusive.
# The readr package is efficient for reading (importing) of CSV files
library(readr)
# Read the data in
winter <- read_csv("winter.csv", col_names = TRUE, col_types = NULL)
head(winter)
## # A tibble: 6 x 9 ## Year City Sport Discipline Athlete Country Gender Event Medal ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1924 Chamo~ Biath~ Biathlon BERTHET, G. FRA Men Militar~ Bron~ ## 2 1924 Chamo~ Biath~ Biathlon MANDRILLON,~ FRA Men Militar~ Bron~ ## 3 1924 Chamo~ Biath~ Biathlon MANDRILLON,~ FRA Men Militar~ Bron~ ## 4 1924 Chamo~ Biath~ Biathlon VANDELLE, A~ FRA Men Militar~ Bron~ ## 5 1924 Chamo~ Biath~ Biathlon AUFDENBLATT~ SUI Men Militar~ Gold ## 6 1924 Chamo~ Biath~ Biathlon JULEN, Alph~ SUI Men Militar~ Gold
Let's say we are interested in comparing the number of gold medals won by Canada, Norway, and Sweden from the last five Winter Olympic Games.
We need to subset the data to focus only on:
# The dplyr package includes useful data manipulation functions
library(dplyr)
# Subset the data
goldMedalComparison <- winter %>%
filter(Medal == "Gold"
& Country %in% c("CAN", "NOR", "SWE")
& Year >= 1998)
goldMedalComparison
## # A tibble: 358 x 9 ## Year City Sport Discipline Athlete Country Gender Event Medal ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1998 Nagano Biath~ Biathlon BJOERNDAL~ NOR Men 10KM Gold ## 2 1998 Nagano Biath~ Biathlon HANEVOLD,~ NOR Men 20KM Gold ## 3 1998 Nagano Bobsl~ Bobsleigh LUEDERS, ~ CAN Men Two-M~ Gold ## 4 1998 Nagano Bobsl~ Bobsleigh MACEACHER~ CAN Men Two-M~ Gold ## 5 1998 Nagano Curli~ Curling BETKER, J~ CAN Women Curli~ Gold ## 6 1998 Nagano Curli~ Curling FORD, Ati~ CAN Women Curli~ Gold ## 7 1998 Nagano Curli~ Curling GUDEREIT,~ CAN Women Curli~ Gold ## 8 1998 Nagano Curli~ Curling MCCUSKER,~ CAN Women Curli~ Gold ## 9 1998 Nagano Curli~ Curling SCHMIRLER~ CAN Women Curli~ Gold ## 10 1998 Nagano Skati~ Short Track ~ BEDARD, E~ CAN Men 5000M~ Gold ## # ... with 348 more rows
In the winter data set, the data is structured such that one row is one medal-winning athlete.
However…
For this analysis, we need to wrangle the data to get it into a format where one row represents one gold medal per event (rather than per athlete).
# Create a new variable to identify each unique event
goldMedalComparison$uniqueEventID <-
paste(goldMedalComparison$Year, goldMedalComparison$Sport,
goldMedalComparison$Discipline, goldMedalComparison$Country,
goldMedalComparison$Gender, goldMedalComparison$Event, sep="_")
goldMedalComparison %>% select(uniqueEventID)
## # A tibble: 358 x 1 ## uniqueEventID ## <chr> ## 1 1998_Biathlon_Biathlon_NOR_Men_10KM ## 2 1998_Biathlon_Biathlon_NOR_Men_20KM ## 3 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man ## 4 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man ## 5 1998_Curling_Curling_CAN_Women_Curling ## 6 1998_Curling_Curling_CAN_Women_Curling ## 7 1998_Curling_Curling_CAN_Women_Curling ## 8 1998_Curling_Curling_CAN_Women_Curling ## 9 1998_Curling_Curling_CAN_Women_Curling ## 10 1998_Skating_Short Track Speed Skating_CAN_Men_5000M Relay ## # ... with 348 more rows
# Identify duplicates goldMedalComparison$duplicates <- duplicated(goldMedalComparison$uniqueEventID) goldMedalComparison %>% select(uniqueEventID, duplicates)
## # A tibble: 358 x 2 ## uniqueEventID duplicates ## <chr> <lgl> ## 1 1998_Biathlon_Biathlon_NOR_Men_10KM FALSE ## 2 1998_Biathlon_Biathlon_NOR_Men_20KM FALSE ## 3 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man FALSE ## 4 1998_Bobsleigh_Bobsleigh_CAN_Men_Two-Man TRUE ## 5 1998_Curling_Curling_CAN_Women_Curling FALSE ## 6 1998_Curling_Curling_CAN_Women_Curling TRUE ## 7 1998_Curling_Curling_CAN_Women_Curling TRUE ## 8 1998_Curling_Curling_CAN_Women_Curling TRUE ## 9 1998_Curling_Curling_CAN_Women_Curling TRUE ## 10 1998_Skating_Short Track Speed Skating_CAN_Men_5000M Relay FALSE ## # ... with 348 more rows
# Remove duplicates from the data # and remove temporary variables used for identifying duplicates goldMedalComparison <- goldMedalComparison %>% filter(duplicates == FALSE) %>% select(-uniqueEventID, -duplicates)
goldMedalComparison %>% select(Year, Sport, Country, Event)
## # A tibble: 105 x 4 ## Year Sport Country Event ## <dbl> <chr> <chr> <chr> ## 1 1998 Biathlon NOR 10KM ## 2 1998 Biathlon NOR 20KM ## 3 1998 Bobsleigh CAN Two-Man ## 4 1998 Curling CAN Curling ## 5 1998 Skating CAN 5000M Relay ## 6 1998 Skating CAN 500M ## 7 1998 Skating NOR 1500M ## 8 1998 Skating CAN 500M ## 9 1998 Skiing NOR Slalom ## 10 1998 Skiing NOR 10KM Pursuit ## # ... with 95 more rows
The data is now in a format that is ready for calculating summary statistics.
For instance, we can tabulate the total number of gold medals won by Canada, Norway, and Sweden at each of the Winter Games since 1998.
# The reshape2 package is useful for manipulating data
# from long-to-wide format and vice versa
library(reshape2)
# Calculate total number of gold medals per team per Winter Games
goldMedalTotals <- goldMedalComparison %>%
group_by(Year, City, Country) %>%
summarise(goldMedalTotal = length(Medal)) %>%
ungroup()
# Reshape the data from long-to-wide format for tabulating
goldMedalTotalsWide <- dcast(goldMedalTotals,
Year + City ~ Country,
value.var = "goldMedalTotal")
# Check the reshaped data goldMedalTotalsWide
## Year City CAN NOR SWE ## 1 1998 Nagano 6 10 NA ## 2 2002 Salt Lake City 8 12 NA ## 3 2006 Turin 7 2 7 ## 4 2010 Vancouver 15 9 5 ## 5 2014 Sochi 10 12 2
# NAs appear in the data because Sweden did not win any gold medals # at the 1998 and 2002 Winter Games # Replace these NAs with zeroes goldMedalTotalsWide$SWE[is.na(goldMedalTotalsWide$SWE)] <- 0
# The DT package is used to display interactive tables in HTML documents library(DT)
## Warning: package 'DT' was built under R version 3.5.3
# Display the table datatable(goldMedalTotalsWide, options = list(dom = 't'), rownames = FALSE)
We can use the data frame we've just created - gold medal totals per team per Games - to calculate the median number of gold medals won by Canada, Norway, and Sweden at the last five Games.
goldMedalMedians <- goldMedalTotalsWide %>%
summarise(CAN = median(CAN),
NOR = median(NOR),
SWE = median(SWE))
goldMedalMedians
## CAN NOR SWE ## 1 8 10 2
For demo 2, we will use data from season 1 of the AFLW to show how sports data can be visualised using R.
# This data was obtained from this Github repository:
# https://github.com/ropenscilabs/ozwomensport/tree/master/AFLW
aflwTeamStats <- read_csv("teams.csv", col_names = TRUE, col_types = NULL)
head(aflwTeamStats)
## # A tibble: 6 x 13 ## Club Year Kicks Handballs Marks Clearances Frees_For Frees_Agst Tackles ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Adel~ 2017 132. 57.3 31 19.8 19 15.3 55.6 ## 2 Bris~ 2017 122. 52 37.6 16.9 16.9 16.1 54.4 ## 3 Carl~ 2017 111. 60.6 32.7 20.1 11.9 14.1 49.4 ## 4 Coll~ 2017 120. 47.9 39.7 21.7 15.6 13.6 56.6 ## 5 Frem~ 2017 117. 58.4 33.1 22 12.1 16.3 59.7 ## 6 GWS ~ 2017 108. 50.1 29 22.1 14.9 13.6 55.7 ## # ... with 4 more variables: Goals <dbl>, Behinds <dbl>, Disp_eff <dbl>, ## # Goal_acc <dbl>
Let's compare some basic stats across clubs. First, we need to prepare the data.
aflwTeamStats <- aflwTeamStats %>% filter(Club != "All Clubs")
# The ggplot2 package is widely used for plotting in R
# It uses a layer-by-layer approach to building plots
library(ggplot2)
# Sort by average kicks per game
aflwTeamStats <- aflwTeamStats[order(aflwTeamStats$Kicks), ]
# Build the plot
p <- ggplot(aflwTeamStats, aes(x = reorder(Club, -Kicks), y = Kicks, fill = Club))
p <- p + facet_wrap(~Year, nrow = 1)
p <- p + geom_bar(stat = "identity")
p <- p + scale_y_continuous(
limits = c(0,150), breaks = seq(0,150, by = 25))
p <- p + labs(title = "Comparing average kicks per game across AFLW teams",
x = "Club",
y = "Average kicks per game")
p <- p + theme(legend.position = "none",
axis.text.x = element_text(angle = 90, vjust = 0.5))
aflwKicksPlot <- p
This presentation is not a deep dive into statistics, but R is more than capable of running a vast array of stats including:
Try R: An in-the-browser tutorial for getting to know the R language - no software installation needed.
'Introduction to R': Free course on Data Camp.
'R for Data Science', by Garrett Grolemund & Hadley Wickham.
'Data wrangling, exploration, and analysis with R': Course materials from a unit taught at the University of British Columbia by Prof Jenny Bryan.
'R for data analysis and visualisation': Data Carpentry lesson using ecology data
The Data Science Guide: A useful resource for finding content all over the web for learning different data science skills and programming languages. You can tailor your search to your current level of expertise in a given domain.
'R for Excel Users': Blog post by Gordon Shotwell
'Introduction to R for Excel Users': Free ebook by Tom Hopper
Follow the #rstats hashtag on Twitter
Many prominent R users and developers have blogs and / or tweet regularly. Here are some of my favourites:
@JennyBryan | @ma_salmon | @alice_data | @dataandme | @visnut | @kierisi | @thomasp85 | @xieyihui | @drob | @juliasilge
Join, read, and contribute to the R Studio Community discussions: https://community.rstudio.com/
You can discover lots of weird, wonderful, and useful R packages and repositories on Github: https://github.com/trending/r
AFLW data stored on Github
ATP and WTA tennis data, accessible via the R package deuce
Olympic Sports and Medals, 1896-2014 collated by The Guardian
NBA, NFL, MLB, and soccer data from FiveThirtyEight
NFL play-by-play data, accessible via the R package nflscrapR
MLB data from Lahman's Baseball Database
Cricket data, accessible via the R package cricinfo
WNBA player and game data from WNBA.com
AFL match statistics from AFL Tables
AFL draft and trade data from Draft Guru